Enhancing Feature Space through Scraping for Gemeente Amsterdam
GitHub

Jan 2022 ~ MSc Course "Data Systems Project"

Length: 1mo (at 1.0 FTE)

Programming language: Python (Pandas, requests, Beautiful Soup, RE, NLTK, Math, datetime, GeoPandas, SciPy, scikit-learn)

Software: Tableau

Data:
- BAG dataset (Basisregistratie adressen en gebouwen = basic registration addresses and buildings), containing information about every building in Amsterdam, such as the address, the neighborhood, and the function of the building
- Unstructured data on the Internet

Problem description:
Enhance the feature space for fire effect modeling through web scraping and design a dashboard to visualize the results

Approach & Results:
The municipality of Amsterdam is responsible for setting up fire safety inspections for the buildings in Amsterdam. However, because there are over 500.000 buildings in the city, the municipality created a ranking that sorts all the properties in descending order based on the risk score, defined as chance x effect. During one of the meetings with Gemeente Amsterdam, a gap in the effect score was noticed. Hence, the proposed solution is based on the idea that there is unstructured information publicly available online that can positively contribute to a more accurate effect score.

Diagram Infrastructure

The infrastructure of the proposed system can be seen above. The system is composed of two main parts, namely the scraping and visualizing. The first one starts by extracting names of public assets within Amsterdam from Wikipedia that will play as the rows of the first dataset. Since the constructions are communal, it was assumed that their online popularity reflects their true real-world popularity. Thus, extra features from Wikipedia, Tripadvisor, Google, and Flickr were scraped using Beautiful Soup to represent the interest in the respective objects. The described process is depicted in the upper branch of the diagram, following the black arrows into the Monuments (POI) dataset.

On the other hand, since other non-public buildings are also important, their addresses were taken from the BAG dataset, and the value of each feature was computed considering the public assets in the vicinity and aggregating their respective variables. One can see this in the diagram following the blue arrows. Finally, the two derived datasets are visualized in a Tableau dashboard that allows the user to apply various filters and give specific attributes more importance if wanted.

The final dashboard can be accessed at:
https://public.tableau.com/app/profile/fabian4248/viz/GroupD1_16439031396800/DOCUMENTATION?publish=yes

Previous project Next project

Home

Address
Amsterdam, the Netherlands

Jan 2022 ~ MSc Course "Data Systems Project"

Address